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Abstract 

^ ■ For linear models with a diverging number of parameters, it has recently 

5^ I been shown that modified versions of Bayesian information criterion (BIC) can 

identify the true model consistently. However, in many cases there is little 
justification that the effects of the covariates are actually linear. Thus a semi- 
parametric model such as the additive model studied here, is a viable alterna- 
tive. We demonstrate that theoretical results on the consistency of BIC-type 
criterion can be extended to this more challenging situation, with dimension 
diverging exponentially fast with sample size. Besides, the noise assumptions 
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are relaxed in our theoretical studies. These efforts significantly enlarge the 
applicability of the criterion to a more general class of models. 

Keywords: Bayesian information criterion (BIC); Selection consistency; 
Sparsity; Ultra-high dimensional models; Variable selection. 

1 Introduction 



With rapid increases in the production of large dimensional data by modern technol- 
ogy, more and more studies have focused on variable selection problems where the 
goal is to identify the few relevant predictors among a large collection of predictors, 
which might even outnumber the sample size due to the constraint of experimental 
costs. For example, in microarray experiments investigating genetic mechanisms of 
a certain disease, thousands of genes are assayed all at once while the number of 
samples is constrained by the cost of arrays as well as by the rarity of the disease in 
the population. 

In linear models with fixed dimen sion, performanc e of vari o us cri teria fo r variable 



select ion is well known (|Shad . 



19971 ). including AIC flAkaikel . Il970h . BIC fISchwarz 



19651 ). Cp (JMallowsl . Il973h etc. In particular, BIC was shown to be consistent in 
variable selection. More recently, penalization approaches to variable selection have 
drawn increasing attention due to thei r stability and computat i onal a ttractiveness 



Tibshiranj 



1996; 



Yuan and Lid. 



20111). Following this trend. 



2006 



Wang et al 



'an and Li 



2001 



Zou 



2006 



Wang et al 



(120071 ) has shown that BIC computed along 



the solution path of the penalized estimator is also selection consistent. 

Nevertheless, these traditional criteria are too liberal for regression problems with 
high dimensional covariates, in that they tend to incorporate many spurious covari- 
ates in the model selected. On the positive side, modifications of BIC by using 
a statistically motivated larger penalty term can successfully address this problem. 



make the crit erion provab l y con s istent, and exhi 



applications ( IWang et al. 



2009 : 



j it sat isfactory performance in real 



Chen and Chen 



20081 ). Despite these efforts, the 



works mentioned above, particularly the theoretical investigations, entirely focused 
on parametric linear models with Gaussian noise, while in many applications there is 
little a priori justification that the covariates actually have such simple linear effects 
on the responses. 



The additive model introduced by 



Stond (119851 ) represents a more flexible class 



of semiparametric models that allows a general transformation of each covariate to 
enter as an additive component. This raises an interesting question: is there an 
appropriately modified BIC-type criterion that can consistently identify the nonzero 
components in this class of semiparamet ric models? Although a similar question has 
been answered in an affirmative way in IWang and Xial (120091 ) for fixed-dimensional 
varying-coefficient models, it remains a coniect ure for high dimensional semiparamet- 



ric problems. We note that 



Huang et al. 



(120101 ) has used modified BIC-type criterion 
in selecting the tuning parameter in group LASSO penalty for additive models, but 
they did not demonstrate the theoretical property of such a criterion. Compared 
to parametric models, the approximation errors for the component functions poses 
additional challenges to our analysis. 

In this paper, we will investigate the theoretical property of BIC-type criterion in 
additive models with the number of components p growing much faster than sample 
size n. To be more specific, we assume \ogp = o{n^'^/^'^'^^^^) where d characterizes the 
smoothness (roughly the number of derivatives) of the component functions. Follow- 
ing the existing literature, we say the problem has a ultra-high dimensionality. On 
the other hand, the number of truly nonzero components is assumed to be fixed and 



does n ot diverge with sample size, for the same reason as discussed in iHuang et al. 
(120 lOl ). Besides, although we acknowledge that it might be restrictive to assume 
that all components have the same smoothness, it would be hard, if not impossible, 



to satisfactorily deal with the more general ca se. Finally, it is worth 



we re 



Chen and Chenl (120081 ): 



noting that 



Wang et al 



ax the Gaussian noise assumption used in 
(J2009l ) to sub-Gaussian noise. The Gaussian assumption was key to make the theoret- 
ical analysis tractable in those studies (see for example (B.3) in IWang et al.l ( l2009l )). 
With sub-Gaussian noise, we need to resort to studying the tail probability of some 
quadratic forms involving sub-Gaussian random variables. 



2 Bayesian Information Criterion for Unpenalized 
Polynomial Spline Estimators 

Consider regression problems with observations {Yi,Xi),i = l,...,n that are inde- 
pendent and identically distributed (i.i.d.) as (Y, X), where F is a scalar response and 
X = (Xi, . . . , Xp)'^ contains p covariates. Substantial progress has been made on lin- 
ear regression when p is large, with or without penalty. Since fitting fully nonparamet- 
ric models is infeasible for large dimensions, an el egant solution to relax the strong lin- 



earity assumption, known as the additive model (jStone 



1985; 



Hastie and Tibshirani 



1990l ). was proposed to avoid this difficulty, which is specified by 
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[1) {eqn:ani} 



where /i is the intercept, fj are unknown univariate component functions and e^ are 
i.i.d. mean zero noises. 

Without loss of generality, we assume the distribution of Xj is supported on [0, 1] 
and also impose the condition Efj{Xj) = for identifiability. We use polynomial 
splines to approximate the components. Let tq = < ti < ■ ■ ■ < tk' < 1 = tk'+i be 
a partition of [0, 1] into subintervals [r^, r^+i). A; = 0, . . . , i^' with K' internal knots. 



We only restrict our attention to equally spaced knots although data-driven choice 
can be considered such as putting knots at certain sample quantiles of the observed 
covariate values. A polynomial spline of order g is a function whose restriction to 
each subinterval is a polynomial of degree q — 1 and globally q — 2 times continuously 
differentiable on [0, 1]. The collection of splines with a fixed sequence of knots has 
a normalized B-spline basis {Bi{x), . . . ,Bj^{x)} with K = K' + q. Because of the 
centering constraint Efj{Xj) = 0, we instead focus on the subspace of spline functions 
Sj := {s : s = Ylk=i^ikBjk{x),YTi=is{Xi3) = 0} with basis {Bj^ix) = Bk{x) - 
Y17=i Bk{Xij)/n, k = 1, . . . , K = K — 1} (the subspace is K = K ~ 1 dimensional 
due to the empirical version of the constraint). Using spline expansions, we can 
approximate the components by fj{x) ~ Ylik^jkBjk^x). Note that it is possible to 
specify different K for each component but we assume they are the same for simplicity 
(using the same i^'s is reasonable when all components have the same smoothness 
parameter) . 

Suppose the true components are /oj, 1 < j < p, and the true intercept is denoted 
by yUQ. We consider a sparse model where only the first s components are nonzero. 
In unpenalized estimation, the following least squares estimation procedure is used 
to find the spline coefficients: 

p K 

(/i, h) = arg miny^CYi - /" - T^ ^^ bjkBjk{Xij)f. (2) {eqnimin} 

i j=l k=l 

However, the resulting estimator cannot be consistent when p diverges at a sufficiently 
fast rate. Thus, we restrict our search on submodels where at most M components 
are nonzero, where M is a known fixed upper bound for s, and perform least squares 
re gression with no more t han M components in ([2]). Similar constraint is also imposed 



m 



Chen and ChenI (120081 ) for linear models. 



Let 



/ 



BjiiXij) Bj2{Xij) 



Bjk{Xij) 



\ 



nxK 



(^1, 



,Z,),Y 



y Bjl{Xnj) Bj2{Xnj) ■■■ BjxiXnj) j 

- (Yi, . . . , F„)^. For any submodel indicated by S* C {1, . . . ,p}, 
let Zs be the submatrix of Z containing the columns in 5, and similarly defined 65, 65, 
etc. For notation convenience, we add (1, . . . , 1)/VA as the first column of Z, Zs and 
define a = (-\/A/i, 6-^)-^, as = (vKyU, b^)^, such that for the submodel S ([2]) can be 
written in matrix form as 

0,5 = min I |y — ^505 11^. (3) {eqn:min2} 

as 

Let the true model be indicated by 5*0 = {1, ... , s}. 

Now we can define the BIC-type criterion for the semiparametric model as 



BIC{S) = \og{\\Y-Zsasf) + \S\K 



log n + log p 



n 



(4) {eqn:bic} 



where IS"! is the size of the set S. The submodel S that achieves the minimum value of 
the above (over all submodels with 15*1 < M) i s chosen as the final model. The form of 



the above penalty is the same as that used in iHuang et aL 



A 



20101) for group ada ptive 



LASSO estimator, which is slightly different from that of 

easily seen to be asymptotica lly equivalent since log (^) ~ j log p,j 



Chen and ChenI fcoOSl ). but 



penalty in 



Wang et al 



M. The 



( 120091 ) ■ adapted to the semiparametric context here, is of the 
form Cn\S\K\ogn/n for some C„ — )■ 00. We will try to be slightly more general and 
present our theoretical results for a general penalty term denoted by pen{S). 
The following technical conditions are assumed. 

(cl) The covariate vector X has a continuous density supported on [0, 1]^. Further- 
more, the marginal densities for Xj,l < j < p are all bounded from below and 
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above by two fixed positive constants respectively. 

(c2) The mean zero noises e^ are independent of covariates, have variance cr^, and 
are sub-Gaussian. That is there exists some a > such that -E'[exp{te}] < 
exp{t2a2/2}. 

(c3) /oj, 1 < j < s satisfies a Lipschitz condition of order d > 1/2: |/q| (t) — 
foj {s)\ ^ C|s — tl'^^L'^J, where \_d\ is the biggest integer strictly smaller than d 
and /qJ is the [rfj-th derivative of foj. The order of the B-spline used satisfies 
q>d + 2. 



(c4) The number of nonzero components is s = 0(1 
(c5) K\og{pn)/n — )■ 0, K — )■ oo, K\og{pn)/n+K^'^'^ 



o{mmi<j<s II /( 



Oil 



K 



-2d 



o{mmi<j<s \\foj\\^),pen{So) 
o{pen{S) — pen{So)) for S 2 Sq, K\og{pn)/n 



0{pen{S) — pen{So)) for S ^ 5*0. 
Most of the assumptions a re sta ndard in the literature. Assumptions (cl)-(c4) are 



also assumed in 



Huang et al 



( 120101 ). However, we will r iot assume th a t miri i<j<..i ||/( 



Huang et al 



'Oil 



torn . Instead, 



is bounded away from zero as in assumption (Al) of 
(c5) makes it clear that this quantity is allowed to converge to zero at a certain rate. 
Also note that in previous studies on the consistency of BIC-type criterion in linear 
models, Gaussian noise is assumed. We relax this assumption at the cost of more 
sophisticated arguments. We collect the assumptions on the convergence/divergence 
rate of different quantities in (c5). The expressions in (c5) can be simplified when 



K ~ n^/(^'i+^) 



variance ([Stone 



lis is the theoretically optimal choice of K that balances bias and 



19851 )) and pen (S") = \S\K{\ogn + \ogp)/n (see Corollary [T] below). 



Theorem 1 Assume conditions (cl)-(c5). Then 



P{S = So)^l. 



By this theorem, we know that with probabihty tending to 1, any model with 
size no larger than M cannot be selected by BIC-type criterion, other than the true 
one. For particular form of the penalty function stated above, we have the following 
corollary. 

Corollary 1 If K ~ n^/^^^+i), logj9 = o(n2'^/(2d+i)^^ mini<j<, ||/ojf » (log(pn))n"2<i/(2d+i)^ 
then under conditions (cl)-(c4) the BIC-type criterion defined in ^ is selection con- 
sistent. 

3 Bayesian Information Criterion for Penalized Es- 
timators 

In the last section we stated that BIC-type criterion is consistent for variable selection 
for unpenalized estimators. However, even when the size of the submodels under 
consideration is constrained by M, brute-force search is still infeasible for large p. 
This is one of the reasons why penalized estimators become so popular in recent 
years. Here we briefly discuss how the results in the previous section can be extended 
to penalized estimator. 

In our context, the penalized estimator is defined by 

p 
ax = argmm\\Y - Zaf + ^px{\\bj\\), (5) {eqn:pen} 

i=i 

where A is the tuning parameter controlling the sparsity of the solution, with larger 
A resulting in more components estimated as zero. Let Sx = {j : bxj 7^ 0} be the 
submodel represented by ax- Here we focus on t he group adaptive LASSO penalty 



since this is the one studied in 



(I2OIC1I ) for ultra-high dimensional additive 



models, although we expect selection consistency for estimators with SCAD penalty 



flFan and Li l200l[ ) or MC P fIZhaneJ. I20 



assume all the conditions in 

estimator is defined as 



Huang et al. 



of) can be derived in a similar way. Thus we 



(J2010[ ). The BIC-type criterion for penalized 



BIC{X) = log(||r - Zaxf)+pen{Sx), 



and the opti mal tuning parameter is A = argminA>o BIC{\) 



Following 



Huang et al. 



( I2OIOI ). for the group adaptive LASSO estimator, the 



penalty term in ([5]) is X]j=i -^ll^jll/ll^ill where ||6j|| is the initial group LASSO es- 
timato r. The following discussio ns are mainly exten sions of arguments in 



fennoh . Based on Corollary 2 in 



Huang et al 



Wang et al 



f!2ninh . liK ^ n^/(2d+i) g^j^^ ^i^g tuning 



parameter is chose to be A„ ~ ^Jn, the estimator a\^ represents the correct model 
(that is h\^j = for j > s, or in other words S\^ = So). Since bx^j = for j > s, 
dx^so = iVKf^x„, &A„i, • • • , h„sV niust be the minimizer of 



\Y - Zsoa 






^ill. 



which yields by first order condition Sa, 



So 



ZlZs,r\ZlY + u), where 






^n{0, -— — 
«1 



"Anl 



a=a\„S, 










i^Aniir ' iiosi 



l«AnS 



\T 



We have pf = 0{Xl/K) = 0{n/K). Thus 



\Y — Za\ IP — ||y — Zq^ 5,9, I 

I ^n II II ^\n '-'Xn I 



\Zs,{ZlZsXM? - 2(1^ - PsJ){Zs,ZlZs, 



V 



0{{Kln)\vf + ^K\nlK'^^^^m,v\ 



0{^/K + n/K^'i). 



Thus 



BIC{\) - BIC{\n) 
= log(||F - Za,f) - log(||r - Za,„f)+peniSx)-peniSxJ 
> log(||F - Zds.f) - log(r - ZdxJ')+pen{Sx)-peniSx„) 
= log(||F - ZasJ') - log(||r - Zds,^^ f ) + pen{Sx) - pen{Sx„) + 0{^/K + n/K^'^) 
= BIC{Sx) - BIC{So) + Oi^/K + n/K^'i). 

A look at the proof for Theorem [1] in the Appendix shows that when Sx 7^ Sq the 
gap between BIC{S\) and BIC{Sq) is actually larger than 0{\J K + n/K'^'^), so the 
0{\JK + n/K^'^) actually does not affect the result and we still have BIC{\) — 
BIC{Xn) > with probability tending to 1 uniformly over all A such that Sx 7^ Sq 
and |5a| < M. 

4 Conclusion and Discussion 

In this paper, we showed that the BIC-type criterion can be used in additive models 
with ultra-high feature dimensions to consistently select the true model. This paper 
is mainly of theoretical interest, and numerical evidence of its performance was con- 
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tained already in iHuang et al.l (|2010[ ). Although the BIG- type criterion is consistent 
for both unpenalized and the penalized estimators, computational constraints imply 
that the latter should be used in practice to avoid brute-force search over submod- 
els. When the dimension of the feature space is so high that penalized approaches 
cannot be directly applied due to comput ational reasons, nonparametric indepen- 
dence screening procedure (jFan et al.l . 1201 ll ) can be used as a first step to reduce the 
dimensionality. 

The BIC-type criterion for penalized estimator focuses on the choice of tuning 
parameter A and ignores the choice of K (the number of knots in B-spline approx- 



imation). In pr a ctice, K can be fix e d to a reasonab 



Yu and RuppertI (120021 ): 



Huang et al. 



Fan et al. 



e integral value as done in 
( I2OIII ) and some sensitivity 



(120101); 

analysis might be justified. It remains an open problem whether some criterion ex- 
ists for data-driven choice of K in high-dimensional contexts that has the desired 
theoretical property (in particular results in i^ ~ n^'^'^'^^^'^). 



Appendix: Proofs 



By well-known properties of B-splines, there exists h\ 



'oj 



ib, 



'Ojl, 



OOJK 



1"^ that sat- 



isfies the approximation property || Ylik^oikB jk{x) — /oj(a;)||oo = 0{K^'^). Let oq = 
(VA/io, &01' • • • ' ^op)"^ ^^"i similarly define a^s for a submodel 5*. In our proofs, C 
denotes a generic positive constant. We first present a Lemma which will be useful 
in the proof of the Theorem. 



Lemma 1 



sup 1 11^ — Zsas\ 

5D5o:|S|<M 



\Y - Zsa^sU = 0{nK-'^) + o{K\og{pn)). 
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Proof of Lemma [II We have 

||F - Zs&sW^ - \\Y - ZsaosW^ 
= -2(F - ^5005)^5(05 - a-os) + \\Zsias - aosW 
= -2e^Zs{as - aos) - 2(/o(X) - ^5005)^^5(05 - ^05) + \\Zsias - aos)f, 



where e 



ei, 



(6) 



and /o(X) = (/o(Xi),...,/o(X„))^ with /o(X,) = fio + 



Sj=i foji^ij) being the true regression function evaluation at covariate Xj. 

By definition we have 05 = {Z'^ Z s)"^ Zg {Zg aos + (/o(-^) — Z^aos) + e) and thus 
as - aos = iZ^Zsy^Z'^{fo{X) - Zjaos) + {ZjZs)~'^Z'^e. Plugging this expression 
into (Q we get 

||F — ^50511 — ||F — ^5005!! 
= -2e^P5e-4e^P5(/o(X)-Zjao5) 

-2(/o(X) - Z^aosfPsifoiX) - Z^a^s) + 11^56 + Ps{fo{X) - ZaosW 
= 0{e''Pse + {fo{X)-Z^aosfPs{fo{X)~Z^aos)), (7) 



where Ps = Zs{ZgZs) ^Z^ is a projection matrix. 

Obviously (/o(X) - Z^aosfPs{fo{X) - Zjao5) = 0{nK-^^). Next we will show 
^'^Ps:\s\<M^^Ps^ = o{K \og{pn)) . Since we do not assume the errors are Gaussian, the 
quadratic form cannot be written as sum of chi-squared random variables. Fortunately 
we can still resort to results on quadratic forms for sub-Gaussian random variables. 
Specifically by Proposition 1.1 in iMikoschI (jl99ll ). when y > Ka"^, we have 



P{e^Pse > a^MK + y)) < expl-Cy/a^}, 
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and thus 



P{ sup e^Pse > a'^MK + y)) < 0{p^) exp{-Cy/a^}, 

S:\S\<M 



and if one takes y = 6K\og{pn) for any 6 > 0, the above probabihty will tend to 0. 
This shows sup^.^g^^j^^jt^Pse = o{K\og{pn)). D 

Proof of Theorem [1], The proof is split into two parts, considering the under- 
fitted models (some nonzero components are not in S) and overfitted models (some 
zero components, as well as all nonzero components, are included in S) respectively. 

Part 1: Sq % S. 

Let as and aso be the least squares estimator under submodel S and the true 
model 5*0 respectively. Let S = S U Sq. With abuse of notation, as is also used to 
denote |S'|i^+l- dimensional vector where the coefficients not associated the submodel 
S is filled in by zero. Similar statement applies to other notations such as asg, as^ 
etc. Thus we can write expressions such as Z^as even though S ^ S. That is, zero 
values are filled in to match the dimension. Then we have 



lY-Zgasf-WY-ZsdsX 

-2{Y - Zgas,,fZs{as - as,,) + WZgias - as^W 



7T 



T t 



-2e Zg{as - aso) + '^{Z^as^ - /o(^)) ^5(^5 - aso) + 11^5(^5 - a^o 



By existing results on spline estimator in additive models (jStond . Il985l ). we know 
that when the true model is known, ||a5(,— ao^oll = 0{K/y/n+K~'^^^^'^). Besides, since 
some nonzero components in a^p is estimated as zero in as, we know \\ds — dosoW ^ 



mmi<j<. 



Joj\ 



> C\'K{m.mi<j<s Wfoj \\ —K ) by the approximate property of splines. 
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Thus uniformly for all S ^ Sq, 



\\as - asoW > \\as - aosoW - IWoSo - «5oll > C{Vk min ||/oj|| - K/y/n- K '^+^1'^). 
Denote the right hand side ab ove by 7„, then the third term in ([8]) is bounded below 



by C{n/K)j^ by Lemma 3 in iHuang et al.l (120 lOf ). The absolute value of the second 



term is bounded by VnK~^'^ ^njK'^n and thus is of smaller order than the third 
term. Finally we bound the first term in ([8]) by 

In the proof of Lemma[T] we showed that sup5.|5|<jv/ ^ Ps^ = o{K \og{pn)) and thus by 
condition (c5), ([8]) is bounded below a positive number at least as large as C^n/K)^^. 
We have 



BIC{S) - BICiSo) 

\Y - Zsdsf/n - \\Y - Zs,asJVn 



Lemma[I]implies that WY-Zsoasof/n > \\Y-Zsoaosof/n-0{K-'^'^)-o{{K/n)\og{pn)) > 
||e|| V(2n) - \\Zs,aos, - fo{X)f/n - 0{K-''') - o{{K/n) \og{pn)) ^ aV2. Thus 

BIC{S) - BIC{So) 
> C{ min ||/o,f - K/n - K''^'^) + pen{S) - pen{So), 

which is positive with probability tending to 1 by (c5). Thus P(mms^So:\s\<M BIC{S) — 
BIC{So) > 0) ^ 1. 
Part 2: S D Sq. 



14 



Lemma [T] showed that 



sup \\Y - Zsaosr - \\Y - Zsas\r = 0{nK-^'^) + o{K\og{pn)), (9) {eqn:lem} 

and noting that Zsa^s = Zsodoso fo^ 'S'o C S, we have 
BIC{Sq) - BIC{S) 



< log ( "„.. :!:l ) - (pe^(^) - p^<s,)) 



\ Y — Zspaoso 
\Y-Zsds\\ 

1 /I , il'^~ ^5^0511 - 11^ -^5^511 . / /r.N /c ^^ 

log ( 1 + W - Z a IP ' ^ {pen{S) - pen(5o)). 



Using on]) and similar to the arguments at the end of Part 1, \\Y — ZsasW^/n is 
bounded away from zero uniformly in 5* 3 Sq. And thus BIC{So) — BIC{S) < 
0{K~'^'^) + o{K log{pn) / n) — {pen{S) — pen{So)) < with probability tending to l.D 
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